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Abstract. This paper extends k-means algorithms from the Euclidean 
domain to the domain of graphs. To recompute the centroids, we apply 
subgradient methods for solving the optimization-based formulation of 
the sample mean of graphs. To accelerate the k-means algorithm for 
graphs without trading computational time against solution quality, we 
avoid unnecessary graph distance calculations by exploiting the triangle 
inequality of the underlying distance metric following Elkan's k-means 
algorithm proposed in [5]. In experiments we show that the accelerated 
k-means algorithm are faster than the standard k-means algorithm for 
graphs provided there is a cluster structure in the data. 

1 Introduction 

The k-means algorithm is a popular clustering method because of its simplicity 
and speed. The algorithmic formulation of k-means as well as the solutions of its 
cluster objective presuppose the existence of a sample mean. Since the concept 
of sample mean is well-defined for vector spaces only, application of the k-means 
algorithm has been limited to patterns represented by feature vectors. But often, 
the objects we want to cluster have no natural representation as feature vectors 
and are more naturally represented by finite combinatorial structures such as, 
for example, point patterns, strings, trees, and graphs arising from diverse ap- 
plication areas like proteomics, chemoinformatics, and computer vision. 

For combinatorial structures, pairwise clustering algorithms are one of the 
most widely used methods to partition a given sample of patterns, because they 
can be applied to patterns from any distance space without any additional math- 
ematical structure. Related to k-means, the k-medoids algorithm is a well-known 
alternative that can also be applied to patterns from an arbitrary distance space. 
The k-medoids algorithm operates like k-means, but replaces the concept of mean 
by the set median of a cluster [23]. With the emergence of the generalized me- 
dian [18,6] and sample mean of graphs [13,14], variants of the k-means algorithm 
have been extended to the domain of graphs [13, 6, 14]. 

In an unmodified form, however, pairwise clustering, k-medoids as well as the 
extended k-means algorithm are slow in practice for large datasets of graphs. The 
main obstacle is that determining a graph distance is well known to be a graph 
matching problem of exponential complexity. But even if we resort to graph 
matching algorithms that approximate graph distances in polynomial time, ap- 
plication of clustering algorithms for large datasets of graphs is still hindered by 
their prohibitive computational time. 



For pairwise clustering, the number of NP-hard graph distance calculations 
depends quadratically on the number of the input patterns. In the worst-case, 
when almost all patterns are in one cluster, k-medoids also has quadratic com- 
plexity in the number of distance calculations. If the N graph patterns are uni- 
formly distributed in k clusters, k-medoids requires 0(tN 2 /k) graph distance cal- 
culations, where t is the number of iterations required. For k-means, we require 
kN graph distance calculations at each iteration in order to assign N pattern 
graphs to their closest centroids. Recomputing the centroids requires additional 
graph distance calculations. In the best case, when using the incremental arith- 
metic mean method [15] for approximating a sample mean, N graph distance 
calculations at each iteration are necessary to recompute the centroids. This 
gives a total of t(k + l)N graph distance calculations, where t is the number of 
iterations required. In view of the exponential complexity of the graph matching 
problem, reducing the number of distance calculations in order to make k-means 
for graphs applicable is imperative. 

In this contribution, we propose an accelerated version of k-means for graphs 
by extending Elkan's method [5] from vector to graphs. For this we assume that 
the underlying graph distance is a metric. To avoid computationally expensive 
graph distance calculations, we exploit the triangle inequality by keeping track 
of upper and lower bounds between input graphs and centroids. 

The k-means algorithm for graphs generalizes the standard k-means algo- 
rithm for vectors. Regarding feature vectors as graphs consisting of a single 
attributed node, k-means for graphs coincides with k-means for vectors. The 
proposed accelerated version of k-means for graphs has the following properties: 
First, based on the T-space framework, accelerated k-means can be applied to 
finite combinatorial structures other than graphs like, for example, point pat- 
terns, sequences, trees, and hypergraphs. For sake of concreteness, we restrict our 
attention exclusively to the domain of graphs. Second, any initialization method 
that can be used for k-means for graphs can also be used for the Elkan's k- 
means for graphs. Third, k-means for graphs and its accelerated version perform 
comparable with respect to solution quality. Different solutions are due to the 
approximation errors of the graph matching algorithm and the non-uniqueness 
of the sample mean of graphs but are not caused by the mechanisms to accelerate 
the clustering algorithm. 

The paper is organizes as follows. Section 2 briefly describes the standard 
k-Means algorithm for vectors. Section 3 extends the standard k-means from 
vectors to graphs. Section 4 introduces Elkan's k-means algorithm for graphs. 
Experimental results are presented and discussed in Section 5. Finally, Section 
6 concludes with a summary of the main results and future work. 

2 The k-Means Algorithm for Euclidean Spaces 

This section describes k-means for vectors [24] in order to point out commonal- 
ities and differences with k-means for graphs. 



Algorithm 1 (K-Means Algorithm for Euclidean Spaces) 



01 choose initial centroids y 1 , . . . , y^ G X 

02 repeat 

03 assign each x £ S to its closest centroid y x = argmhiygy \\x — y\\ 2 

04 recompute each centroid y £ y as the mean of all vectors from C(y) 

05 until some termination criterion is satisfied 



Suppose that we are given a training sample S — {xi, . . . , x^} of TV vectors 
drawn from the Euclidean space X. A partition V = {C\, . . . ,Ck} of S into k 
disjoint subsets (clusters) Ci C S is determined by a (N x fc)-membership matrix 
M = (rriij) satisfying the constraints 

k 

rriij = 1 for alii e {1, . . . , 

rriij e {0, 1} for all i e {1, . . . , N}, j e {1, . . . , k} . 

The standard k-means clustering algorithm aims at finding k centroids y = 
{yi, . . . , yk} Q X and a partition M — (rriij) of the set S such that the cluster 
objective 

k N 

j(M,y\s) = -j2J2 m v\\ x *-yi\\ 2 > 

3 = 1 i=l 

is minimized. 

Suppose that we fix an arbitrary membership matrix M. Then the clus- 
ter objective J(. | M,X) given M and X is differentiable as a function of the 
centroids y. The k centroids that minimize J(. | M, X) are the sample means 

1 N 

Vj = T7T~i mi 3 Xi ' 
i=l 

of the clusters Cj consisting of data points x £ S assigned to centroid y 3 . Since 
the k sample mean centroids together with the given membership matrix M 
yields a local minimum of the cluster objective J only, the challenging task 
of minimizing J consists in finding an optimal membership matrix. Since this 
problem is NP-complete [7], several heuristic algorithms have been devised. A 
standard clustering heuristic that minimizes J is the k-means algorithm as out- 
lined in Algorithm 1. The notation C(y) used in Algorithm 1 denotes the cluster 
associated with centroid y £ y. 

3 The k-Means Algorithm for Graphs 

To extend k-means from the domain of feature vectors to the domain of graphs, 
two modifications are necessary [13, 14]: First, we replace the Euclidean metric 



by a graph metric. Second, we replace the sample mean of vectors by a related 
concept for graphs. 

3.1 Metric Graph Spaces 

In principle, we can substitute any graph metric into the standard k-means algo- 
rithm in order to obtain its structural counterpart. Here, we focus on geometric 
graph distances that are related to the Euclidean metric, because the Euclidean 
metric is the underlying metric of the vectorial mean. The vectorial mean in turn 
provides a link to deep results in probability theory and is the foundation for a 
rich repository of analytical tools in pattern recognition. To access at least parts 
of these results, it seems to be reasonable to relate graph metrics to the Euclidean 
metric. This restriction is acceptable from an application point of view, because 
geometric distance functions on graphs and their related similarity functions are 
a common choice of proximity measure [1, 3, 8, 10, 25, 27]. 

Though it is straightforward to define a graph metric, which is related to the 
Euclidean metric of vectors, we first make a detour via the concept of T-space 
in order to approach the sample mean of graphs in a principled way. 

Let E be a c?-dimensional Euclidean vector space. An (attributed) graph is 
a triple X — (V, E, a) consisting of a finite nonempty set V of vertices, a set 
E C V x V of edges, and an attribute function a : V x V —> E, such that 
a(i,j) ^ for each edge and a(i,j) — for each non-edge. Attributes a(i,i) of 
vertices i may take any value from E. 

For simplifying the mathematical treatment, we assume that all graphs are of 
order n, where n is chosen to be sufficiently large. Graphs of order less than n, say 
m < n, can be extended to order n by including isolated vertices with attribute 
zero. For practical issues, it is important to note that limiting the maximum order 
to some arbitrarily large number n and extending smaller graphs to graphs of 
order n are purely technical assumptions to simplify mathematics. For machine 
learning problems, these limitations should have no practical impact, because 
neither the bound n needs to be specified explicitly nor an extension of all 
graphs to an identical order needs to be performed. When applying the theory, 
all we actually require is that the graphs are finite. 

A graph X is completely specified by its matrix representation X = (xij) 
with elements x^j = a(i,j) for all 1 < i,j < n. By concatenating the columns of 
X, we obtain a vector representation x of X. 

Let X = E" xn be the Euclidean space of all (n x n)-matrices and let T denote 
a subset of the set V n of all (n x n)-permutation matrices. Two matrices X £ X 
and X' £ X are said to be equivalent, if there is a permutation matrix P £ T 
such that P T XP = X'. The quotient set 

X T = X/r = {[X] : X £ X} 

is the T-space over the representation space X. A T-space is a relaxation of the 
set Qq- = Q /T of all abstract graphs [X], where X is a matrix representation of 
graph X. 



In the remainder of this contribution, we identify X with (N = n 2 ) 
and consider vector- rather than matrix representations of abstract graphs. By 
abuse of notation, we sometimes identify X with [x] and write x <G X instead of 

x e [x\. 

Finally, we equip a T-space with a metric related to the Euclidean metric. 
Suppose that d(x,y) = \\x — y\\ is an Euclidean metric on X induced by some 
inner product. Then the distance function 



is a metric with the same geometric properties as d. A pair (x, y) e X x Y of 
vector representations is called optimal alignment if D(X, Y) = d(x,y). 

Calculating a Graph Metric. Here, we assume that T is equal to the set 
of all V n of all (n x n)-permutation matrices. Determining a graph distance 
D(X,Y) and finding an optimal alignment of X and Y are equivalent problems 
that are more generally referred to as a graph matching problem. In contrast 
to calculating the Euclidean distance between vectors, computing D(X, Y) is a 
NP-complete problem [8]. Devising graph matching algorithms for computing 
D(X,Y) has become a mature field in structural pattern recognition that has 
produced various powerful and efficient solutions to the graph matching problem 
[2]. To extend k- means to the graph domain any of those algorithms can be used. 

3.2 The Sample Mean of Graphs 

Given the metric space (Xf,D), we introduce the sample mean of graphs and 
provide some results proved in [14]. 

Suppose that St = (X\, . . . , Xn) is a sample of m abstract graphs from 
Qt C Xq-- A sample mean of St is any solution of the optimization problem 



The cost function F is the sum of squared distances (SSD) to the sample graphs. 
Here, the problem is to find a solution from an uncountable infinite set Xt- A 
simpler problem is to restrict the set Xt of feasible solutions to the finite sample 
St Q Xt- A set mean graph of St is defined by 



We summarize the most important results from [14] for deriving subgradient- 
based algorithms for solving problem (P). 

Theorem 1. Let St = {X\, . . . ,Xn) C Qt be a sample of m abstract graphs. 



D(X, Y) = min {d(x, y) : xeX,yeY} 




i=l 



s.t. X e X r 



Y = argmin{F(X) : X e S T } ■ 



1. Problem (P) has a solution. The solutions are abstract graphs from Qt ■ 

2. The SSD function F is locally Lipschitz. 

3. A vector representation y of a sample mean Y G Xt of St is of the form 

1 N 

i=l 

where d(xi,y) = D(Xi, Y) for all i £ {1, . . . , N}. We call the vector repre- 
sentations (oji, . . . ,xn) an optimal multiple alignment of St- 

4. Let (xi, . . . , Xn) be an optimal multiple alignment of St- Then 

N N N N 

i—l j— i+1 i— 1 j=i+\ 

for all vector representations x[ G X\, . . . , x' N G Xn ■ 

The first statement ensures that problem (P) can be solved and has feasible solu- 
tions. Since the SSD satisfies the locally Lipschitz condition according to the sec- 
ond statement, we can apply generalized gradient techniques from nonsmooth op- 
timization for minimizing the SSD [19]. The third statement shows that a vector 
representation of a structural sample mean is the standard sample mean of cer- 
tain vector representations of the sample graphs. In addition, we see that problem 
(P) is a discrete rather than a continuous optimization problem, where a solution 
can be chosen from the finite set X\ x • • • x X m = {(xi, . . . , x m ) : Xi G Xi}. 
The latter property combined with the fourth statement can be exploited for 
constructing search algorithms or meta-heuristics like genetic algorithms. The 
fourth statement asks for maximizing the sum of pairwise similarities (SPS). The 
standard sample mean of a vector representation maximizing the SPS is a vector 
representation of a structural sample mean. Apart from this, the fourth property 
provides a geometric characterization stating that an optimal multiple alignment 
has minimal volume within the subspace spanned by the vector representations. 
In the case that D is derived from the maximum common subgraph problem, 
the fourth property says that an optimal multiple alignment maximizes the sum 
of common edges of the sample graphs. This in turn indicates that computation 
of the sample mean has potential applications in frequent substructure mining. 

A Subgradient Method for Approximating a Sample Mean. So far, we 
have defined a concept of sample mean for graphs. For practical applications, 
we need an efficient procedure to minimize problem (P) in order to recompute 
the centroids of the k- means algorithm for graphs. For this, we assume that 
St = (Xi, . . . , Xn) is a sample St — {X\, . . . , Xn) of m graphs. 

Generic Subgradient Method. Suppose that we want to minimize a locally Lips- 
chitz function / on X. Then / admits a generalized gradient at each point. The 
generalized gradient coincides with the gradient at differentiable points and is a 



Algorithm 2 (Generic Subgradient Method) 



01 set t ~ and choose starting point x* £ X 

02 repeat 

03 Direction finding: 

04 determine d € X and ry > such that /(as* + r?d) < f(f(x*) 

05 Line search: 

06 find step size r)„ > such that 77* « argmin^>o /(as* + ijd) 

07 Updating: 

08 set x t+1 := x* + r),d 

09 set t := t + 1 

10 until some termination criterion is satisfied 



convex set of points, called subgradients, at non-differentiable points. The basic 
idea of subgradient methods is to generalize the methods for smooth problems 
by replacing the gradient by an arbitrary subgradient. Algorithm 1 outlines the 
basic procedure of a generic subgradient method. 

At differentiable points, direction finding generates a descent direction d by 
exploiting the fact that the direction opposite to the gradient of / is locally 
the steepest descent direction. At non-differentiable points, direction finding 
amounts in generating an arbitrary subgradient. The problem is that a sub- 
gradient at a non-differentiable point is not necessarily a direction of descent. 
But according to Rademacher's Theorem, the set of non-differentiable points is 
a set of Lebesgue measure zero. Line search determines a step size 77* > with 
which the current solution aj* is moved along direction d in the updating step. 
Subgradient methods use predetermined step sizes rjt t i, instead of some efficient 
univariate smooth optimization method or polynomial interpolation as in gra- 
dient descent methods. One reason for this is that a subgradient determined 
in the direction finding step is not necessarily a direction of descent. Thus, the 
viability of subgradient methods depend critically on the sequence of step sizes. 
Updating moves the current solution x* to the next solution x l + r/^d. Since the 
subgradient method is not a descent method, it is common to keep track of the 
best point found so far, i.e., the one with smallest function value. For more de- 
tails on subgradient methods and more advanced techniques to minimize locally 
Lipschitz functions, we refer to [19]. 

Several different subgradient methods for approximating a sample mean have 
been suggested [15]. For extending k-means to the domain of graphs, we have 
chosen the incremental arithmetic mean (IAM) method. In an empirical compar- 
ison of 8 different subgradient methods [15], IAM performed best with respect 
to computation time and was ranked third with respect to solution quality. In 
addition, IAM best trades computation time and solution quality. For this rea- 
son, we consider IAM as a good candidate for recomputing the centroids of the 
k-means clusters. 



1AM - Incremental Arithmetic Mean. The elementary incremental subgradient 
method randomly chooses a sample graph X t from St at each iteration t and 
updates the estimates y* € Y* of the vector representations of a sample mean 
according to the formula 

y' +1 =yW (y'-x*), 

where rf is the step size and (ce*,?/ 4 ) is an optimal alignment. 

As a special case of the incremental subgradient algorithm, the incremental 
arithmetic mean method emulates the incremental calculation of the standard 
sample mean. First the order of the sample graphs from St is randomly per- 
muted. Then a sample mean is estimates according to the formula 



y l = y 1 - 1 + -x l for 1< i < N 

i i 

where (a:i,y l_1 ) are optimal alignments for all 1 < i < N. The graph Y rep- 
resented by the vector y N is an approximation of a sample mean of St- In 
general, Y is not an optimal solution of problem (P). This procedure is inspired 
by Theorem 1.3 and requires only one iteration through the sample. The IAM 
method requires m — 1 distance calculations for approximating a sample mean. 
The solution quality depends on the order of selecting the sample graphs from 
S T . 



3.3 The k-Means Algorithm for Graphs 

Having a graph metric and a concept of sample mean, we are now in the position, 
to extend the k-means algorithm to structure spaces Xt over some Euclidean 
space X. We assume that D is a distance metric induced by an Euclidean metric 
on X. Now suppose that St = {X\, . . . , X^} is a training sample of TV graphs 
drawn from Xt- We replace the standard cluster objective J by 

k N 

J T (M, y r | St) = X/ X/ m ^ D Y i) 2 < 

3=1 i=l 

where 3V = {^i> • • • ,Yk} is a se t of k centroids from Xt and M = (m^) is a 
membership matrix defining a partition of the set St- 

Given a membership matrix M, the cluster objective Jt(-\M,X) is no 
longer differentiable as a function of the centroids J^T- But as shown in [14, 15], 
the objective Jt(- \ M, X) is locally Lipschitz and therefore differentiable as a 
function of JV f° r almost all graphs. The k centroids that minimize the cluster 
objective Jt (• | Af , Xt) are the structural versions of the sample mean 

1 N 

Y j = arg min F(Y) = £ m^D (X,,Y) 2 . 



Algorithm 3 (K-Means Algorithm for Structure Spaces) 



01 choose initial centroids Yi, . . . , Yj. G Xt 

02 repeat 

03 assign each X £ St to its closest centroid Yx = argminy e j> r D(X, Y) 2 

04 recompute each Y £ as a sample mean of all graphs from C(Y) 

05 until some termination criterion is satisfied 



Hence, we can easily extend Algorithm 1 to minimize the cluster objective Jt- 
Algorithm 3 describes the basic procedure of the k-means algorithm for structure 
spaces independent of the particular choice of method to minimize the objective 
F of a sample mean. Similarly as for vectors, C(Y) denotes the cluster associated 
with centroid Y € 3V- 

In each iteration of the structural version of k-means requires kN distance 
calculations to assign each pattern graph to a centroid and at least additional 
O(N) distance calculations for recomputing the centroids using the incremental 
arithmetic mean subgradient method. This gives a total of at least 0(kN + N) 
distance calculations in each iteration of Algorithm 3. 

4 Elkan's k-Means for Graphs 

In this section we extend Elkan's k-means [5] from vectors to graphs. 

Frequent evaluation of NP-hard graph distances dominates the computa- 
tional cost of k-means for graphs. Accelerating k-means therefore aims at re- 
ducing the number of graph distance calculations. In [5], Elkan suggested an 
accelerated formulation of the standard k-means algorithm for vectors exploit- 
ing the triangle inequality of the underlying distance metric. Since the distance 
function D on Xt induced by an Euclidean metric is also a metric [16], we can 
transfer Elkan's k-Means acceleration from Euclidean spaces to T-spaces. 

To extend Elkan's k-Means acceleration to T-spaces, we assume that X € St 
is a pattern graph and Y, Y' E yT are centroids. As before, by Yx we denote 
the centroid the pattern graph X is assigned to. Elkan's acceleration is based on 
two observations: 

1. From the triangle inequality of a metric follows 

u(X)< l -D(Y x ,Y) => D(X,Y X )<D(X,Y), (1) 

where u(X) > D (X, Yx) denotes an upper bound of the distance D (X, Y x ). 

2. We have 



u(X) < l(X, Y) => D (X, Y x ) < D (X, Y), (2) 
where l(X, Y) < D (X, Y) denotes a lower bound of the distance D (X, Y). 



Algorithm 4 (Elkan's k-Means Algorithm for Structure Spaces) 



01 choose set = {Yl, . . . , Y^} of initial centroids 

02 set l(X, Y) = for all X G St and for all Y G 3V 

03 set u(X) = oo for all X e St 

04 randomly assign each X G St to a centroid Yx € yT 

05 repeat 

06 compute D (Y, Y') for all centroids Y, Y' G y T 

07 for each X G St and Y G y T do 

08 if Y is a candidate centroid for X 

09 if w(X) is out-of-date 

10 update u(X) = D (X, Y x ) 

11 update l(X, Y x ) = l(X) 

12 if Y is a candidate centroid for X 

13 update I (X, Y) = D (X, Y) 

14 if I (X, Y)<u (X) 

15 update u{X) =1 (X, Y) 

16 replace Yx = Y 

17 recompute mean Y of cluster C(Y) for all Y G 3V 

18 compute 5(Y) = D (Y, Y) for all Y G Vt 

19 set w(X) = u(X) + S (Y x ) for all X e X T 

20 set l(X, Y) = max {/(X, Y) — S (Y), 0} for all X £ X T and for all V G 3^r 

21 replace Y by Y for all Y G y T 

22 until some termination criterion is satisfied. 



Remark: 

1. Setting the value a variable such as h(Y,Y') and u(X) implicitly declares the 
value of that variable as out-of-date. Updating those variables declares the value 
as up-to-date. 

2. The condition in line 08 and 12 is only redundant if the upper bound u(X) is 
up-to-date. 



As an immediate consequence, we safely can avoid to calculate a distance D (X, Y) 
between a pattern graph X and an arbitrary centroid Y if at least one of the 
following conditions is satisfied 

(d) Y = Y X 

(C 2 ) u(X) < \D{Y X ,Y) 

(C 3 ) u(X)<l(X,Y) 

We say, Y is a candidate centroid for X if all conditions (Ci)-(Cs) are violated. 
Conversely, if Y is not a candidate centroid for X, then either condition (C2) 
or condition (C\) is satisfied. From the inequalities (1) and (1) follows that Y 
can not be a centroid closest to X. Therefore, it is not necessary to calculate 
the distance D(X,Y). In the case that Yx is the onliest candidate centroid for 
X all distance calculations D(X, Y) with Y <G 3V can be skipped and X must 
remain assigned to Yx- 



Now suppose that Y ^ Yx is a candidate centroid for X. Then we apply the 
technique of "delayed (distance) evaluation". We first test whether the upper 
bound u(X) is out-of-date, i.e. if u(X) ^ D (X,Yx). If u(X) is out-of-date we 
improve the upper bound by setting u(X) = D (X,Yx). Since improving u(X) 
might eliminate Y as being a candidate centroid for X, we again check conditions 
(C2) and (C3). If both conditions are still violated despite the updated upper 
bound u(X), we have the following situation 

u{X) = D(X,Y X ) > l -D{Yx,Y)u{X) = D{X,Y x ) >1{X,Y). 

Since the distances on the left and right hand side of the inequality of condition 
(C2) are known, we may conclude that the situation for condition (C2) can not 
be altered. Therefore, we re-examine condition (C3) by calculating the distance 
D(X,Y) and updating the lower bound l(X) = D(X,Y). If condition (C3) is 
still violated, we have 

u{X) = D (X, Y x ) > D(X, Y) = l(X, y). 

This implies that X is closer to centroid Y than to Yx and therefore has to be 
assigned to centroid Y. 

Crucial for avoiding distance calculations are good estimates of the lower 
and upper bounds l(X,Y) and u(X) in each iteration. For this, we compute the 
change S(Y) of each centroid Y by the distance 

6(Y) = D(Y,Y), 

where Y is the recomputed centroid of cluster C(Y). Based on the triangle in- 
equality, we set the bounds according to the following rules 

l(X,Y) =max{Z(X,y) - 6(Y),0} (3) 
u(X)=u(X) + 5(Y x ). (4) 

In addition, u(X) is then declared as out-of-date. 1 Both rules guarantee that 
l(X, Y) is always a lower bound of D (X, Y) and u(X) is always an upper bound 
of D{X,Y X ). 

Algorithm 4 presents a detailed description of Elkan's k-means algorithm for 
graphs. During each iteration, k(k — l)/2 pairwise distances between all centers 
must be recomputed (Algorithm 4, line 07). Recomputing the centroids using 
incremental arithmetic mean (see Section 3.2) requires additional O(N) distance 
calculations (Algorithm 4, line 19-20). To update the lower and upper bounds, k 
distances between the current and the new centroids must be calculated (Algo- 
rithm 4, line 21-25). This gives a minimum of O (iV + fc 2 ) distance calculations 
at each iteration ignoring the delayed distance evaluations in line 09-18 of Algo- 
rithm 4. As the centroids converge, one would expect that the partition of the 
training sample becomes more and more stable, which results in a decreasing 
number of delayed distance evaluations. 

1 In the original formulation of Elkan's algorithm for feature vectors, the upper bounds 
u(X) are declared as out-of-date regardless of the value S(Y). 



data set ^(graphs) ^(classes) avg(nodes) max(nodes) avg(edges) max(edges) 



letter 



750 
528 
900 
100 



15 
22 
3 
2 



4.7 
11.5 
8.3 
24.6 



8 

24 
26 
40 



3.1 
11.9 
14.1 
25.2 



6 

29 
48 
44 



grec 

fingerprint 
molecules 



Table 1. Summary of main characteristics of the data sets. 



5 Experiments 

This section reports the results of running k-means and Elkan's k-means on four 
graph data sets. 

5.1 Data. 

We selected four data sets described in [22]. The data sets are publicly available 
at [11]. Each data set is divided into a training, validation, and a test set. In 
all four cases, we considered data from the test set only. The description of the 
data sets are mainly excerpts from [22]. Table 1 provides a summary of the main 
characteristics of the data sets. 

Letter Graphs. We consider all 750 graphs from the test data set representing 
distorted letter drawings from the Roman alphabet that consist of straight lines 
only (A, E, F, H, I, K, L, M, N, T, V, W, X, Y, Z). The graphs are uniformly 
distributed over the 15 classes (letters). The letter drawings are obtained by dis- 
torting prototype letters at low distortion level. Lines of a letter are represented 
by edges and ending points of lines by vertices. Each vertex is labeled with a 
two-dimensional vector giving the position of its end point relative to a reference 
coordinate system. Edges are labeled with weight 1. Figure 1 shows a prototype 
letter and distorted version at various distortion levels. 



Fig. 1. Example of letter drawings: Prototype of letter A and distorted copies generated 
by imposing low, medium, and high distortion (from left to right) on prototype A. 



GREC Graphs. The GREC data set [4] consists of graphs representing symbols 
from architectural and electronic drawings. We use all 528 graphs from the test 
data set uniformly distributed over 22 classes. The images occur at five different 
distortion levels. In Figure 2 for each distortion level one example of a draw- 
ing is given. Depending on the distortion level, either erosion, dilation, or other 




morphological operations are applied. The result is thinned to obtain lines of 
one pixel width. Finally, graphs are extracted from the resulting denoised im- 
ages by tracing the lines from end to end and detecting intersections as well 
as corners. Ending points, corners, intersections and circles are represented by 
vertices and labeled with a two-dimensional attribute giving their position. The 
vertices are connected by undirected edges which are labeled as line or arc. An 
additional attribute specifies the angle with respect to the horizontal direction 
or the diameter in case of arcs. 



Fingerprint Graphs. We consider a subset of 900 graphs from the test data set 
representing fingerprint images of the NIST-4 database [26] . The graphs are uni- 
formly distributed over three classes left, right, and whorl. A fourth class (arch) 
is excluded in order to keep the data set balanced. Fingerprint images are con- 
verted into graphs by filtering the images and extracting regions that are relevant 
[21]. Relevant regions are binarized and a noise removal and thinning procedure 
is applied. This results in a skeletonized representation of the extracted regions. 
Ending points and bifurcation points of the skeletonized regions are represented 
by vertices. Additional vertices are inserted in regular intervals between ending 
points and bifurcation points. Finally, undirected edges are inserted to link ver- 
tices that are directly connected through a ridge in the skeleton. Each vertex is 
labeled with a two-dimensional attribute giving its position. Edges are attributed 
with an angle denoting the orientation of the edge with respect to the horizontal 
direction. Figure 3 shows fingerprints of each class. 



Fig. 3. Fingerprints: (a) Left (b) Right (c) Arch (d) Whorl. Fingerprints of class arch 
are not considered. 




Q 
□ 



Fig. 2. GREC symbols: A sample image of each distortion level 




Molecules. The mutagenicity data set consists of chemical molecules from two 
classes (mutagen, non- mutagen) . The data set was originally compiled by [17] 
and reprocessed by [22]. We consider a subset of 100 molecules from the test data 



set uniformly distributed over both classes. We describe molecules by graphs in 
the usual way: atoms are represented by vertices labeled with the atom type 
of the corresponding atom and bonds between atoms are represented by edges 
labeled with the valence of the corresponding bonds. We used a 1-to-fc binary 
encoding for representing atom types and valence of bonds, respectively. 

5.2 General Experimental Setup 

In all experiments, we applied standard k-means for graphs (std) and Elkan's 
k-means for graphs (elk) to the aforementioned data sets using the following 
experimental setup: 

Setting of k-means algorithms. To initialize the k-means algorithms, we used a 
modified version of the "furthest first" heuristic [9]. For each data set S, the 
first centroid Y\ is initialized to be a graph closest to the sample mean of S. 
Subsequent centroids are initialized according to 

Yi+i = are max min D(X, Y), 

where J^i is the set of the first i centroids chosen so far. We terminated each k- 
means algorithm after 3 iterations without improvement of the cluster objective 
Jr. 

Graph distance calculations and optimal alignment. For graph distance calcula- 
tions and finding optimal alignments, we applied a depth first search algorithm 
on the letter data set and the graduated assignment [8] on the grec, fingerprint, 
and molecule data set. The depth first search method guarantees to return op- 
timal solutions and therefore can be applied to small graphs only. Graduated 
assignment returns approximate solutions. 

Performance measures. We used the following measures to assess the perfor- 
mance of an algorithm on a dataset: (1) error (value of the cluster objective 
Jr), (2) classification accuracy, (3) silhouette index, and (4) number of graph 
distance calculations. 

The silhouette index is a cluster validation index taking values from [—1,1]. 
Higher values indicate a more compact and well separated cluster structure. For 
more details we refer to Appendix A and [24]. Elkan's k-means and graph- vector 
reduction k-means incur computational overhead to create and update auxiliary 
data structures and to compute Euclidean distances. This overhead is negligible 
compared to the time spent on graph distance calculations. Therefore, we report 
number of graph distance calculations rather than clock times as a performance 
measure for speed. 

5.3 Performance Comparison 

We applied standard k-means (std) and Elkan's k-means (elk) to all four data sets 
in order to assess and compare their performance. The number k of centroids 



data set 
letter 



30 



grec 



33 



measure 

error 

accuracy 

silhouette 

iterations 

matchings (xlO 3 ) 

per iteration 

total 
speedup 

per iteration 

total 

error 

accuracy 

silhouette 

iterations 

matchings (xlO 3 ) 

per iteration 

total 
speedup 

per iteration 

total 



std elk 



11.6 
0.86 
0.38 
21 

23.2 
488.4 

1.0 
1.0 

32.7 
0.84 
0.40 
11 

18.0 
197.5 

1.0 
1.0 



11.5 
0.86 
0.39 
13 

3.3 
42.5 

7.1 
11.5 

32.2 
0.83 
0.44 
11 

5.7 
63.1 

3.1 
3.1 



fingerprint 



60 



error 1.88 1.70 

accuracy 0.81 0.82 

silhouette 0.32 0.31 

iterations 10 11 
matchings (xlO 3 ) 

per iteration 54.9 4.8 

total 549 52.4 
speedup 

per iteration 1.0 11.5 

total 1.0 10.5 



molecules 



10 



error 27.6 27.2 

accuracy 0.69 0.70 

silhouette 0.03 0.04 

iterations 13 13 
matchings (xlO 3 ) 

per iteration 1.1 1.1 

total 14.3 14.5 
speedup 

per iteration 1.0 0.94 

total 1.0 0.94 



Table 2. Results of different k-means clusterings on four data sets. Columns labeled 
with std, elk, and gvr give the performance of standard k-means for graphs, Elkan's 
k-means for graphs, and graph-vector reduction k-means, respectively. Rows labeled 
matchings give the number of distance calculations (xlO 3 ), and rows labeled speedup 
show how many times an algorithm is faster than standard k-means for graphs. 



as shown in Table 2 was chosen by compromising a satisfactory classification 
accuracy against the silhouette index. For each data set 5 runs of each algorithm 
were performed and the best cluster result selected. 

Table 2 summarizes the results. The first observation to be made is that the 
solution quality of std and elk is comparable with respect to error, classification 
accuracy, and silhouette index. Deviations are due to the non-uniqueness of the 
sample mean and the approximation errors of the graduated assignment algo- 
rithm. The second observation to be made from Table 2 is that elk outperforms 
std with respect to computation time on the letter, grec, and fingerprint data 
set. On the molecule data set, std and elk have comparable speed performance. 
Remarkably, elk requires slightly more distance calculations than std. 

Contrasting the silhouette index and the dimensionality of the data to the 
speedup factor gained by elk, we make the following observation: First, the sil- 
houette index for the letter, grec, and fingerprint data set are roughly compa- 
rable and indicate a cluster structure in the data, whereas the silhouette index 
for the molecule data set indicates almost no compact and homogeneous cluster 
structure. Second, the dimensionality of the vector representations is largest for 
molecule graphs, moderate for grec graphs, and relatively low for letter and fin- 
gerprint graphs. Thus, the speedup factor of elk and gvr apparently decreases 
with increasing dimensionality and decreasing cluster structure. This behavior 
is in line with findings in high-dimensional vector spaces [5]. According to [20], 
there will be little or no acceleration in high dimensions if there is no underlying 
structure in the data. This view is also supported by theoretical results from 
computational geometry [12]. 



5.4 Speedup vs. Number k of Centroids 



In this experiment we investigate how the speedup factor of elk depends on 
the number k of centroids. For this, we restricted to subsets of the letter and 
fingerprint data sets. We selected 200 graphs uniformly distributed over the four 
classes A, E, F, and H. From the fingerprint data set we compiled a subset of 
300 graphs uniformly distributed over all three classes. For each chosen number 
k of centroids 10 runs of each algorithm were conducted and the average of all 
performance measures was taken. The number k is shown in Table 3 for letter 
graphs and Table 4 for fingerprints graphs. 

From the results shown in Table 3 and 4, we see that the speedup factor 
slowly increases with increasing number k of centroids. The results confirm that 
std and elk perform comparable with respect to solution quality for varying k. As 
an aside, all k-means algorithms for graphs exhibit a well-behaved performance 
in the sense that subgradient methods applied to the nonsmooth cluster objective 
Jt indeed minimize Jq- in a reasonable way as shown by the decreasing error 
for increasing k. 



data set 



k measure 



std elk 



letter 

(A, E, F, H) 



letter 

(A, E, F, 



H) 



error 6.9 7.0 

accuracy 0.61 0.60 

silhouette 0.26 0.25 

iterations 14 15 

matchings (xlO 2 ) 

per iteration 10.0 4.4 

total 140.0 65.5 
speedup 

per iteration 1.0 2.3 

total 1.0 2.1 

error 4.2 4.2 

accuracy 0.82 0.82 

silhouette 0.30 0.30 

iterations 13 14 

matchings (xlO 2 ) 

per iteration 18.0 5.1 

total 234.0 71.3 
speedup 

per iteration 1.0 3.5 

total 1.0 3.3 



letter 

(A, E, F, 



12 



H) 



error 
accuracy 
silhouette 
iterations 
matchings (xlO 2 

per iteration 

total 
speedup 

per iteration 

total 



2.7 
0.94 
0.31 

16 

26.0 
416 

1.0 
1.0 



2.7 
0.94 
0.30 

18 

5.5 
98.7 

4.2 
4.7 



letter 

(A, E, F, 



16 



H) 



error 2.4 2.4 

accuracy 0.94 0.94 

silhouette 0.22 0.23 

iterations 16 16 
matchings ( x 10 ) 

per iteration 34.0 6.7 

total 544.0 107.8 
speedup 

per iteration 1.0 5.0 

total 1.0 6.1 



Table 3. Results of k-means clusterings on a subset of the letter graphs (A, E, F, 
H) for four different values of k = 4, 8, 12, 16. Shown are the average values of the 
performance measures averaged over 10 runs. 



data set 
fingerprints 



measure std elk 

error 41.3 41.5 

accuracy 0.59 0.59 

silhouette 0.20 0.21 

iterations 7 7 
matchings (xlO 2 ) 

per iteration 12.0 9.5 

total 84.0 66.8 
speedup 

per iteration 1.0 1.3 

total 1.0 1.3 



fingerprints 15 



error 8.1 6.8 

accuracy 0.64 0.64 

silhouette 0.32 0.33 

iterations 10 9 
matchings (xlO 2 ) 

per iteration 48.0 15.4 

total 480.0 138.4 
speedup 

per iteration 1.0 3.1 

total 1.0 3.5 



Table 4. Results of k-means clusterings on a subset of the fingerprints graphs for two 
different values of k = 3 and k = 15. Shown are the average values of the performance 
measures averaged over 10 runs. 



6 Conclusion 



We extended Elkan's k-means from vectors to graphs. Elkan's k-means exploits 
the triangle inequality to avoid graph distance calculations. Experimental results 
show that standard and Elkan's k-means for graphs perform equally with respect 
to solution quality, but Elkan's k-means outperforms standard k-means with 
respect to speed if there is a cluster structure in the data. The speedup factor 
of both accelerations increases slightly with the number k of centroids. This 
contribution is a first step in accelerating clustering algorithms that directly 
operate in the domain of graphs. Future work aims at accelerating incremental 
clustering methods. 

A The Silhouette Index 

Suppose that S = {X\, . . . , X m } is a sample of m patterns. Let C = {C\, . . . , C&} 
be a partition of S consisting of k disjoint clusters with 

k 

i=l 

We assume that D is the underlying distance function defined on S. The distance 
between two subsets U,W C S is defined by 

D(U,U') = min {D (X, X') : X e U, X' E U'} . 

\iU = {X} consists of a singleton, we simply write D (X,W) instead of D ({X}, U'). 
Let 

Atvg (XM) 

denote the average distance between pattern X E S and subset U C S. Suppose 
that pattern Xi <E S is a member of cluster C m ^ £ C. By C' m ^ we denote the 
set C m (j) \ {^i}- For each pattern Xi G S let 

a-i = Aivg (^ C m(i)) 

be the average distance between pattern Xi and subset C' m ^y By 

6j = min D avg (Xi,Cj) 

we denote the minimum average distance between pattern Xi and all clusters 
from C not containing Xi. The silhouette width of Xi is defined as 



The silhouette of cluster Cj € C is given by 

The silhouette index is then defined as the average of all cluster silhouettes 

k 

3 = 1 
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